5728-5741 Nucleic Acids Research, 2014, Vol. 42, No. 9 
doi: 10.1093/nar/gku212 



Published online 20 March 2014 



Large-scale analysis of tandem repeat variability in 
the human genome 

Jorge Duitama 1,2,t , Alena Zablotskaya 3,4 , Rita Gemayel 1 , An Jansen 1,3,4 , Stefanie Belet 3 4 , 
Joris R. Vermeesch 5 , Kevin J. Verstrepen 1 and Guy Froyen 3,4 

1 VIB lab for Systems Biology & CMPG Lab for Genetics and Genomics, KU Leuven, B-3001 Leuven, Belgium, 
2 Agrobiodiversity Research Area, International Center for Tropical Agriculture (CIAT), Cali, Colombia, 3 Human 
Genome Laboratory, VIB Center for the Biology of Disease, Leuven, Belgium, 4 Human Genome Laboratory, 
Department of Human Genetics, KU Leuven, B-3000 Leuven, Belgium and 5 Center for Human Genetics, University 
Hospitals Leuven, KU Leuven, B-3000 Leuven, Belgium 

Received November 1, 2013; Revised February 27, 2014; Accepted February 28, 2014 



ABSTRACT 

Tandem repeats are short DNA sequences that are re- 
peated head-to-tail with a propensity to be variable. 
They constitute a significant proportion of the hu- 
man genome, also occurring within coding and reg- 
ulatory regions. Variation in these repeats can alter 
the function and/or expression of genes allowing or- 
ganisms to swiftly adapt to novel environments. Im- 
portantly, some repeat expansions have also been 
linked to certain neurodegenerative diseases. There- 
fore, accurate sequencing of tandem repeats could 
contribute to our understanding of common pheno- 
typic variability and might uncover missing genetic 
factors in idiopathic clinical conditions. However, de- 
spite long-standing evidence for the functional role 
of repeats, they are largely ignored because of tech- 
nical limitations in sequencing, mapping and typing. 
Here, we report on a novel capture technique and 
data filtering protocol that allowed simultaneous se- 
quencing of thousands of tandem repeats in the hu- 
man genomes of a three generation family using GS- 
FLX-plus Titanium technology. Our results demon- 
strated that up to 7.6% of tandem repeats in this fam- 
ily (4% in coding sequences) differ from the reference 
sequence, and identified a de novo variation in the 
family tree. The method opens new routes to look 
at this underappreciated type of genetic variability, 
including the identification of novel disease-related 
repeats. 



INTRODUCTION 

Repetitive DNA sequences make up a significant portion of 
all genomes. Almost half of the human genome is comprised 
of repeats (1). A subset of repeated DNA is composed of 
tandem repeats, which are stretches of DNA that consist of 
tandemly repeated short sequence units (e.g. CAG) next to 
each other. The terms microsatellites and minisatellites are 
also frequently used to denote tandem repeats of short (< 
9bp) or long (> 9bp) repeated units, respectively (for a com- 
plete list of terms used in this manuscript, see Supplemen- 
tary Text SI). Tandem repeats can be mutational hotspots 
due to their repetitive nature; slippage during DNA repli- 
cation or recombination events generate alleles that differ 
in the number of repeated units (called 'copy numbers'). 
Compared to other genomic loci, the mutation rates of tan- 
dem repeats are 10 to 10 000 fold higher (2). Because of 
this instability and apparent lack of genetic information, 
most tandem repeats were thought to be devoid of direct 
biological function and termed 'junk' DNA (3). Tandem 
repeats did prove extremely useful as genetic markers in 
fine-scale genotyping and forensics. They also provide an 
added advantage to genome-wide linkage studies because 
of their higher diversity compared to single nucleotide poly- 
morphisms (SNPs) (4). In certain cancers, the mutational 
spectrum of microsatellites appears to be tumor-type spe- 
cific, thus opening new avenues for the use of microsatellites 
as genetic markers for disease diagnosis (5). 

While many tandem repeats (also called 'repeats' fur- 
ther on) are present in gene deserts, the accumulation of 
whole genome sequencing data showed that repeats are also 
present in functional (coding and regulatory) regions of the 
genomes. 

Past research demonstrated that tandem repeats located 
within regulatory or coding regions can act as variable "tun- 
ing knobs" that can tune the function or expression of a 
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gene (6). Most of this research was focused on simple model 
organisms such as Saccharomyces cerevisiae (7). However, 
recent findings suggest that repeats are equally important 
sources of phenotypic variability and disease in higher eu- 
karyotes, including plants, metazoans, mammals and hu- 
mans (8,9). For example, an intriguing study by Fondon and 
Garner (10) uncovered a strong correlation between varia- 
tion in repeats located in two key regulatory genes (Alx4 
and Runx2) and skeletal morphology in dogs, suggesting 
that repeat variation in these genes may affect skull shape. 
Moreover, instability in such coding or regulatory repeats 
can have devastating consequences for human health. There 
are several examples where expansion of a repeat close or 
even within a human gene leads to severe diseases: Hunt- 
ington disease, fragile X syndrome, and spinal and bulbar 
muscular atrophy are among them (1 1-13). In all these pro- 
gressive disorders the severity and the age of onset of symp- 
toms are directly correlated with repeat copy number in a 
particular gene, since allelic differences in tandem repeat 
copy numbers can influence allelic expression, e.g. in case of 
the ATRXgem (14). In addition, tandem repeat variability 
in certain genes (e.g. Thymidylate Synthase gene) are asso- 
ciated with a poor prognosis in a number of cancers (15,16). 
However, despite the high number of genes potentially af- 
fected by tandem repeats, and despite the growing evidence 
of the functional role of variable repeats, most studies that 
report genetic variation do not consider repeat variability 
and only focus on SNPs and copy number polymorphisms 
of larger segments (>1 kb up to few Mb). 

The ubiquitous presence of repeats in functional parts 
of genomes in spite of the potentially devastating conse- 
quences of their variability suggests that repeats might also 
serve a beneficial functional role. Moreover, tandem repeats 
are not randomly present in coding sequences. Genes that 
encode regulatory, cell-wall and stress-induced proteins are 
particularly enriched in repeats, whereas genes encoding 
metabolic enzymes are depleted. Strikingly, this functional 
enrichment is evolutionary conserved from yeasts to hu- 
mans (2,8,17). As several reports documented, variable tan- 
dem repeats can provide functional diversity allowing rapid 
evolution of phenotypes. In S. cerevisiae, gradual changes 
in intragenic tandem repeats in the FLOl gene (coding for a 
cell-surface protein) lead to gradual changes in the adhesion 
properties of the cells, allowing tuning of biofilm formation 
(2). Similarly, variable tandem repeats in promoters allow 
fine-tuning and rapid divergence of downstream gene ex- 
pression (7,18,19). In the human genome, evidence suggests 
that some tandem repeat polymorphisms are under positive 
selection in certain parts of the world (20). 

Despite their prevalence in functional parts of genomes 
and their association with almost 20 human neurological 
diseases, and despite their usefulness as genetic markers, 
tandem repeats remain understudied. A precise mapping 
of all tandem repeats in the human genome is not fully 
achieved, and the relevant literature often contains conflict- 
ing results. For example, the lobSTR software (21) called 
45 461 microsatellite loci from a trio of sequenced genomes, 
far less than the 376 685 microsatellites detected by another 
study using the same genome (22). This is mostly due to 
the technical difficulties associated with sequencing repeats. 
Although improved methods are being introduced, the cur- 



rent standard next generation sequencing (NGS) techniques 
are not adequate for tandem repeat genotyping because 
reads with short lengths cannot be confidently aligned to 
genomic regions with tandem repeats. This technical short- 
coming implies that biologically important variations are 
being missed in many of today's genomics studies. As a re- 
sult, even basic aspects of tandem repeats, such as the degree 
of variability between individuals and whether this (non- 
pathological) variability has functional consequences, re- 
main open questions. Such research questions require long 
reads (>300 bp) that are currently restricted to the rela- 
tively low-throughput NGS methods GS-FLX and SMRT 
(Roche). 

Although new analysis software packages specifically de- 
signed for genotyping tandem repeats from short reads have 
been recently published (21,23), they are only able to geno- 
type repeats with short unit length and low copy numbers 
from Illumina whole -genome sequencing libraries. Here, we 
set out to develop a strategy that permits a more accurate 
mapping of tandem repeats and also allows better assess- 
ment of repeat variability between individuals. We report a 
method based on targeted enrichment for tandem repeats 
in the human genome, followed by sequencing using the 
Roche 454 platform. Using this method, we were able to 
reliably genotype over 1600 tandem repeats in seven mem- 
bers of a three -generation family. We performed extensive 
polymerase chain reaction (PCR) validation experiments to 
confirm the accuracy of our method and we investigated the 
properties of tandem repeats that can be targeted using this 
technology. Our genotype calls reveal a surprising degree 
of variability within tandem repeats, even within repeats lo- 
cated in coding regions and between direct relatives. 

MATERIALS AND METHODS 

Selection of DNA samples 

The protocol was approved by the Medical Ethical Com- 
mission of the University Hospitals Leuven (Belgium) with 
number B322201 1 1 1336. The family was chosen because of 
the availability of sufficient DNA from seven family mem- 
bers in three generations. Informed consent was obtained 
from each family member or their parents. Genomic DNA 
was extracted from peripheral blood according to standard 
procedures. The pedigree of the selected family is shown in 
Figure 1 and includes the grandparents (grandfather: GF, 
and grandmother: GM), their daughter (mother: Mo) and 
her husband (father: Fa), and the three children (Chi, Ch2, 
Ch3). This pedigree thus has four trios, which allowed us to 
look at de novo mutations in tandem repeats and at general 
variation compared to the reference genome and within the 
family. 

Selection of tandem repeats 

We downloaded the last human reference genome (hgl9) 
from the UCSC server (24). To retrieve regions with tan- 
dem repeats in this reference, we ran the tool ETANDEM 
available in the EMBOSS package (18) with default param- 
eters. ETANDEM calculates a consensus sequence for a pu- 
tative repeat region and scores potential repeats based on 
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Figure 1. Pedigree of the three-generation family consisting of seven fam- 
ily members from whom the tandem repeats were sequenced. The four trios 
are indicated with tl, t2, t3 and t4. In the first 454 runs, samples GF, GM, 
Fa and Mo were sequenced. In the second run, samples Mo, Chi , Ch2 and 
Ch3 were analyzed. GF: grandfather; GM: grandmother; Fa: father; Mo: 
mother; Chl-3: child 1-3. 



the number of matches and mismatches there are to the con- 
sensus: the score is incremented (+ 1 ) for a match and decre- 
mented (—1) for a mismatch (http://emboss.sourceforge. 
net/apps/release/6.0/emboss/apps/etandem.html). Because 
ETANDEM does not predict mononucleotide repeats (re- 
peats with unit length equal to one), we downloaded from 
the UCSC server annotated repeats predicted by the Tan- 
dem Repeats Finder (TRF) tool (25), and we selected all the 
mononucleotide repeats from this dataset. We included the 
mononucleotide repeats in order to obtain the most com- 
prehensive set of repeats irrespective of the NGS platform 
and associated software. However, the use of 454 GS-FLX 
for sequencing, which allowed us to obtain long reads, did 
not allow us to perform reliable genotyping of mononu- 
cleotide repeats in the downstream analysis. We also down- 
loaded the track of microsatellites, which are mostly repeats 
with unit length two and three with high variability (25). 
Finally, we considered repeats known to be variable from 
three additional sources: repeats used for genotyping and 
forensic applications (see: http://www.cstl.nist.gov/biotech/ 
strbase/), repeats related to diseases (8) and repeats stud- 
ied by Matsumoto et al. (26). We classified all these repeats 
based on their location relative to the catalog of annotated 
genes in the following categories: coding, intron, 5 ' UTR, 
3' UTR, upstream (1 kb before the 5' UTR) and down- 
stream (300 bp after the 3' UTR). We also created sepa- 
rate categories for genes spanning annotated microRNAs, 
regulatory elements (transcription factor binding sites) an- 
notated in the ORegAnno database (27) and CpG Islands 
annotated also in the UCSC database (28). Repeats that did 
not span any of these functional elements were classified as 
intergenic. 

To select the set of targeted repeats for sequencing, we 
first set the maximal repeat length (total length of a repeat 
region, as observed in the reference genome) arbitrarily to 



250 bp, which retained 96% of the repeats gathered (792 
394) as explained above. To predict genotype variability, 
we calculated a 'variability score' for each repeat using the 
SERV algorithm (17). This algorithm uses a support vector 
machine to provide for each repeat a score, which is based 
on unit length, total repeat length and purity. The pheno- 
typic variability is predicted as follows: we classified the re- 
peats in two groups, one having only repeats in deep intronic 
sequences or intergenic regions (called the RI group), and 
the other having repeats located within functional regions 
(termed the RF group), including coding and potential reg- 
ulatory regions (5 ' UTR, 3 ' UTR, upstream, downstream, 
microRNA, transcription factor binding sites and CpG Is- 
lands). It is generally assumed that repeats in the RF group 
are more likely to produce phenotypic variation when com- 
pared to those in the RI group. 

From the RF group (33 807), we selected repeats for 
which we could design at least two unique probes, and 
which belong to at least one of the following subgroups: (i) 
mononucleotide repeats (211); (ii) repeats with SERV score 
equal or greater than 1 (2299); (iii) repeats in transcription 
factor binding sites (1090); (iv) repeats in coding regions 
with ETANDEM score equal or larger than 21 (3889) and 
(v) repeats with ETANDEM score larger than 45 (2093). 
The scores for the latter two RF subgroups were arbitrar- 
ily selected to yield a total number of selected repeats that 
would be feasible to sequence, when added up with those se- 
lected for the RI group. The final number for the RF group 
(as a union) here was 7724. 

Because the total size of the RI group is much larger 
than the size of the group RF and repeats in this group 
are less likely to produce phenotypic variation, we applied 
more stringent rules to select repeats from the RI group. 
First, we selected repeats for which we can design three or 
more probes, with SERV score equal to or larger than 1 , and 
with ETANDEM score larger than 100 (75 for mononu- 
cleotides). We selected a total of 673 repeats from RI fol- 
lowing these conditions. Second, to compare variability be- 
tween the RF and the RI group for different kind of repeats, 
we built a histogram with the selected repeats in the RF 
group based on their unit length, copy number and GC con- 
tent. Then, from all repeats in RI for which we can design 
at least three probes, we selected 2338 random repeats fol- 
lowing the probability distribution defined by the histogram 
and added them to the list of 673 RI repeats (details can be 
provided upon request). 

We completed the set of selected tandem repeats in the 
RF (7724) and RI (301 1) groups by adding the repeats from 
the forensic, disease related datasets and the repeats ana- 
lyzed by Matsumoto and colleagues (26) for which we could 
design at least one probe: 23, 16 and 6 repeats, respectively. 

Probes design for selected tandem repeats 

For each candidate probe, we used the following two strate- 
gies to predict specific hybridization: first, we called a probe 
not unique if a stretch of more than 25 consecutive bases, 
not including the repeat, spans a region that is masked in 
the reference genome by RepeatMasker. Second, for each 
probe passing the previous filter, we simulated three con- 
secutive fragments corresponding to the first, middle and 
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last 40 bp of the probe and we mapped those to the ref- 
erence genome using Bowtie (29). We used a base quality 
score of 20 for each base of each simulated read and asked 
Bowtie to retain alignments with a sum of mismatch quali- 
ties less than or equal to 200, which retains alignments with 
up to 10 mismatches. We also set a seed length of 15 bp 
and allowed each alignment to have up to three mismatches 
in the seed. We finally set the 'try hard' option to find as 
many good alignments as possible. Given the Bowtie align- 
ments, we calculated the number of mismatches of the sec- 
ond best hit for each fragment of each probe. We finally 
called a probe unique if the number of mismatches of each 
fragment with its second best alignment is greater than 3 
and the total sum of mismatches over the whole probe is 
greater than 19. This procedure ensures that there will be 
at least 20 well-distributed mismatches with the second best 
hit of each probe that we call unique. For spanning probes, 
we also asked that the unit length of the corresponding re- 
peat is at least 3, and for the repeats with unit length equal 
to 3, the total repeat length is at most 80 to avoid hybridiza- 
tion to undesired di- or trinucleotides with the same motif 
of a targeted repeat. 

Capture and sequencing of tandem repeats 

Library preparation was done following the manufacturer's 
instructions by the Genomics Core of the University Hospi- 
tals Leuven (http://gc.uzleuven.be/). Briefly, genomic DNA 
of each family member was sonicated in a Bioruptor (Di- 
agenode) to yield fragments of 200-1500 bp, which was size- 
selected on agarose gels to yield DNA bands of 500-800 
bp. Selected fragments were blunt-ended, adapters were lig- 
ated and then purified again. SureSelect (Agilent) sequence 
capture was performed by the Genomics Core according to 
the manufacturer's protocols. Four samples were loaded in 
a four-lane gasket of the GS FLX+ instrument (Roche) and 
run at one lane per sample with the GS-FLX Titanium XL+ 
kit (Roche), which produces single-end reads up to 600 bp. 

Bioinformatic analysis of sequence data 

We aligned reads to the reference genome using bwa-sw with 
default parameters (30). We implemented a custom script to 
perform local alignment and identify the 45-bp long flank- 
ing regions in the reads mapping next to targeted repeats, 
and extract the segment corresponding to the repeat region. 
We also implemented a script that gathers all reads genotyp- 
ing the same repeat in different samples, performs multiple 
sequence alignment, and corrects for homopolymer errors. 
After corrections, the script again splits the reads over the 
different samples and performs independent genotype call- 
ing by choosing the two alleles with the highest read counts. 
We set a minimum threshold of two reads for the second 
most frequent allele to call a genotype heterozygous. Soft- 
ware information for this type of analysis is available as 
Supplementary Methods SI. For each family member, we 
thus generated five data points for each repeat: (i) the copy 
number of the most frequent allele (CI), (ii) the number of 
reads of that allele (Rl), (hi) the copy number of the second 
(most frequent) allele (C2), (iv) the number of reads of that 
second allele (R2) and (v) the total number of reads (RT) 



for that repeat. In most cases, RT = Rl + R2, although 
RT could be >R1 + R2 because of the presence of single 
reads, which are not called but included in the RT count, 
or because of the presence of alleles of an apparent third 
allele with a lower read number than R2, or slippage asso- 
ciated with the sequencing protocol. Polymorphism in cod- 
ing repeats was analyzed with the PROVEAN Protein tool 
(31). PROVEAN calculates a delta alignment score based 
on the reference and variant versions of a protein query se- 
quence with respect to sequence homologs collected from 
the non-redundant (NR) protein database at the National 
Center for Biotechnology Information (NCBI) through Ba- 
sic Local Alignment Search Tool. If the PROVEAN score is 
equal to or below a predefined threshold, the protein vari- 
ant is predicted to have a 'deleterious' effect, otherwise it is 
predicted as 'neutral'. 

Validation of sequence data 

The genotypes of the family members obtained by 454 se- 
quencing were validated by the method of fragment analy- 
sis. For that, two rounds of PCR were performed followed 
by capillary electrophoresis. The first PCR consisted of 15 
cycles performed on 50 ng genomic DNA in a 25 |xl mix- 
ture using GoTaq Flexi DNA Polymerase (Promega) and 
0.2 |xM unlabeled specific primers designed in CLC Main 
Workbench (CLC bio, Denmark). A 21 bp extension of the 
M13 primer sequence was added to the 5 '-end of each for- 
ward primer. All primer sequences can be obtained upon 
request. The second PCR consisted of 20 cycles and was 
performed on 2 jjlI of the first PCR product using a FAM- 
labeled Ml 3 primer in combination with each respective re- 
verse primer. Final products were mixed with the GeneS- 
can 500 ROX Size Standard (Lifetechnologies) and sub- 
jected to capillary electrophoresis on an ABI3500xL Ge- 
netic Analyzer (Lifetechnologies). Fragment lengths were 
determined with the GeneMapper v4.1 software (Lifetech- 
nologies). Conclusions on allele calls were made only by 
comparing the three samples of a trio in the same run. To 
verify potential de novo mutations, we performed Sanger 
sequencing as follows: 35 cycles were performed with the 
respective unlabeled primers and the BigDye v3.1 cycle se- 
quencing kit, analyzed on the ABI3500xL instrument. In 
case of heterozygosity, however, we first ligated the purified 
PCR product in the pGEM-T Easy vector (Promega) and 
sequenced clones with the different alleles. Sequences were 
analyzed and aligned in the BioEdit v. 7.1 software (Ibis 
Biosciences). 

RESULTS 

A novel strategy for targeted sequencing of tandem repeats 

We developed a novel strategy to sequence thousands of re- 
peats in the human genome, which can be summarized in 
three main steps. First, we aimed to capture specific repeats 
from sheared total genomic DNA. These fragments are sub- 
sequently sequenced using the 454 GS-FLX-Plus Titanium 
system, a technology that yields the long read lengths (>500 
bp) required to span complete tandem repeats plus some of 
the flanking regions. In a final step, the reads are mapped 
onto the reference genome to identify the specific repeat, 
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and subsequently the reads are filtered and analyzed to yield 
an accurate estimate of the total repeat length. 

We started gathering 792 394 predicted and validated tan- 
dem repeats in the current reference genome (hgl9) from 
a wide variety of sources: ETANDEM predictions, UCSC 
tracks, databases for genotyping and forensic applications, 
and previous reports describing disease-related or highly 
variable repeats (see Materials and Methods for details). 
For each tandem repeat we predicted its most likely genetic 
annotation based on the UCSC genome browser (24). Ta- 
ble 1 summarizes the number of repeats gathered from the 
different sources and classified in the different genetic an- 
notation classes. We found tandem repeats of lengths <250 
bp in the coding region of 4180 genes and in the promoter 
of 5859 additional genes, which represents, respectively, 9% 
and 12.61% of the catalog of 46 454 protein coding genes 
available in the UCSC database. Supplementary Figure SI 
shows the distribution of repeats for different unit lengths. 
In contrast with other categories, repeats in coding regions 
often have units that contain multiples of three nucleotides. 
This finding is expected given the variability of repeats and 
selection against frameshift mutations. We estimated the de- 
gree of variability for each repeat using the SERV model 
(17). Predictions performed with this model indicate that 
repeats in coding regions are on average less variable than 
repeats in other regions, even though as many as 873 re- 
peats located within coding regions are predicted to be ex- 
tremely variable (VARScore > 1 ) and more than 3000 cod- 
ing repeats are predicted to be highly variable (VARScore 
between 0 and 1) (Supplementary Figure S2). 

Because we can genotype only a few thousand repeats 
with the reads produced by a single run of the GS FLX+ 
system, we designed a set of rules to select tandem repeats 
for which (i) we can design at least two different unique 
probes, (ii) a total repeat length can be covered by a GS 
FLX+ read including the flanking sequences, (iii) there is a 
high probability of being variable among different individ- 
uals and (iv) there is a potential to induce phenotypic vari- 
ability (see Materials and Methods for details). We selected 
7728 repeats in presumed functional domains of which 3891 
are in coding genes, 245 are in noncoding RNA and 1836 
are in promoter or terminator regions. The remaining 1756 
repeats are in intergenic regions with a predicted regulatory 
function according to UCSC annotations. Together, this set 
is referred to as 'repeats in functional regions (RF group)'. 
A second set of 3018 repeats located in gene deserts and 
deep intronic sequences (RI group) and with the same dis- 
tribution as the presumed functional repeats was also in- 
cluded, yielding a total of 10 746 repeats targeted for se- 
quencing. This total set covers a wide range of unit lengths, 
copy numbers, GC content and functional characteristics 
(see Supplementary Table SI for detailed information of 
each repeat). 

Because tandem repeat sequences are by nature not 
unique, they cannot be specifically captured using tradi- 
tional approaches. Hence, we designed a novel sequence 
strategy to capture the (unique) sequences that flank these 
target repeats and purify genomic DNA fragments that con- 
tain one of the targeted repeats. For the enrichment of the 
selected tandem repeats, we used custom-designed 120 nt 
RNA capture probes. For each repeat in our initial database, 



we defined three classes of probes: (i) flanking probes ('left' 
and 'right'), which bind the region immediately flanking the 
tandem repeat including 20 nt inside the repeat; (ii) 'special' 
probes, carrying 60 nt from both flanking sides of the repeat 
and (iii) 'spanning' probes, which include the repeat itself as 
well as part of the flanking regions (Figure 2). To maximize 
the success rate of our capture strategy, we ensured that all 
designed probes will uniquely hybridize in the genome (see 
Materials and Methods for details). To balance the number 
of probes per repeat, we included unique flanking probes 
twice if only two different probes could be designed, or if 
no spanning probe could be designed. All spanning probes 
were also added two times. Finally, because it is well known 
that probes with a too low (<40%) or too high (>70%) GC 
content are less effective for DNA capture, we also included 
every such probe twice. As a result, a minimum of four and 
a maximum of seven probes (of two to four different types) 
were present for each repeat, resulting in a total of 54 752 
probes for the 10 746 selected tandem repeats. 

To assess the effectiveness of the different types of probes 
that we designed, we selected 257 random repeats from the 
dataset of repeats in RF for which only one type of probe 
could be defined ( 1 14 with a left probe, 52 with a right probe, 
48 with a spanning probe and 43 with a special probe). Each 
probe was included four times. We also selected 172 random 
repeats from the dataset of repeats in RF for which both 
flanking probes could be designed and we included each 
flanking probe twice. 

Genotyping accuracy and efficiency 

Sequence capture was done with the SureSelect kit from Ag- 
ilent. For NGS, the GS FLX+ system is expected to yield 
about 700 million sequenced nucleotides per single run with 
an accuracy of 99.997%. We simultaneously sequenced four 
samples per run using the four-lane gasket, so we did not 
need to tag the samples. In the first run, we sequenced the 
samples GF, GM, Mo and Fa. In the second run, the sam- 
ples were Mo, Chi , Ch2 and Ch3. The sample of the mother 
was again included in the second run because the number 
of reads was lower compared to the other three samples. 
For each sample we obtained on average 200 000 reads (161 
340-228 591) with an average read length of 405 bp (Table 
2). The distribution of read lengths is given in Supplemen- 
tary Figure S3. Over 95% of the reads aligned to a unique 
location in the reference genome. About 60% of the reads 
aligned close to or in a targeted repeat region. For about 
40% of these reads (25% of the total) we could reliably iden- 
tify the two 45 bp long sequences flanking the targeted re- 
peat and genotype the repeat region using the sequence be- 
tween the flanks. We thus obtained 338 046 reads useful for 
genotyping of our selected tandem repeats, with an average 
read length of 501 bp (Table 2), which represents an aver- 
age of 48 292 reads per sample (40 929-63 279). Sequenc- 
ing reads are available in the NCBI Sequence Read Archive 
(http://www.ncbi.nlm.nih.gov/sra) under accession number 
SRP033260. 

We obtained at least one useful read that allowed to make 
allele calls for 6161-6632 tandem repeats (57.3-61.7%) per 
sample, >4 reads for 3233-3957 repeats (30.1-36.8%) and 
>10 reads for 1171-2285 tandem repeats (10.9-21.3%). Of 
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Table 1. Number of repeats gathered from different sources and classified using UCSC annotations 



TRF Unit 



UCSC 



Combined 



Categories 


ETANDEM 


Length = 1 


Microsatellites 


Forensic 


Matsumoto 


Disease 


Datasets 


Intergenic 


414 188 


34 543 


22 218 


31 


77 


0 


438 833 


Intron 


294 430 


33 593 


17 830 


17 


59 


4 


319 754 


TFBS 


1495 


60 
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Figure 2. Schematic of probes design for an example repeat of total length 80 bp (striped box). The length of all probes is 120 nt. The flanking probes on 
the left and right (red and green bars, respectively) are composed of 20 nt inside the repeat and 100 nt of unique flanking sequence. The special probe (split 
blue bar) mixes together the 60 nt of unique sequences flanking the repeat at the left and right. The spanning probe (orange bar) covers the whole repeat 
plus a few unique bases in the flanks, nt: nucleotides. 



those with at least one read, a mean coverage which we 
calculated as an average number of useful reads obtained 
per locus was between 6.3 (for Chi) and 9.8 (for GM). 
Analysis of the percentage of targeted repeats that were se- 
quenced demonstrated only a modest effect with increas- 
ing of repeat lengths (Supplementary Figure S4). The likeli- 
hood that longer repeats are fully sequenced is smaller than 
those of shorter ones but still is about 50% for repeat lengths 
between 200 and 250 bp. 

The distribution of the 'number of repetitive loci' for dif- 
ferent coverage levels (Supplementary Figure S5) indicates 
that the reads are not evenly distributed among the tar- 
geted repeat regions and that we did not obtain any reads 
for about 2500 repeats. Because this behavior is consistent 
among the different samples (Supplementary Figure S6), we 
investigated the main factors that determine the coverage 
of a targeted repeat. First, despite the doubling of probes 
with extreme GC content, repeats with GC content <20% 
or >60% (Figure 3 A) and repeats with total lengths >200 
(Figure 3B) in general show lower coverage (P-value of a 



Wilcoxon rank test <10 -16 ). Further analysis revealed that 
the capture efficiency increases with the number of differ- 
ent probe types (Figure 3C). It is of interest to note that the 
inclusion of 'special' probes seems to greatly increase the 
capturing efficiency. However, the coverage obtained for the 
429 control repeats with only one type of probe present (in 
four-fold) was not different between the various probe types 
(data not shown). Finally, the coverage was also correlated 
with the functional classification, with repeats in CpG is- 
lands and 5' UTRs showing less coverage than repeats in 
other regions (Figure 3D). It is important to note, however, 
that we cannot assess whether these biases are due to differ- 
ences in capture efficiency, or in sequencing. 

For each tandem repeat, we calculated the copy number 
in both alleles of each individual (eg. CAGCAGCAG are 
three copies of a CAG repeat) (see Materials and Meth- 
ods for details). Because the seven family members make 
four different trios, we used Mendelian inheritance as an 
initial cross-validation of the genotype calls. For each trio, 
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Figure 3. Coverage for repeats with (A) different GC content, (B) different total repeat lengths, (C) different numbers of distinct probes, and (D) different 
functional roles. Data are provided as mean ± SE for the coverage between different repeats within each discriminating feature represented in the different 
panels. 



we could test Mendelian inheritance on about 5200 repeats 
and observed Mendelian consistency in about 88% of those. 

Validation of sequence data and sequencing output filtering 

To verify the apparent Mendelian inconsistencies, we per- 
formed validation of 66 repeats by fragment analysis. In the 
same way, we analyzed an additional 42 repeats for which no 
Mendelian inconsistencies were detected in any of the trios. 
Together, this also allowed us to assess the quality of our 
genotyping algorithm. The fragment lengths for each repeat 
were compared with the copy number obtained by GS-FLX 
sequencing. We found that most of the genotyping errors 
in our sequencing strategy were caused by 'PCR stutters' 
produced at the PCR amplification step during the library 
preparation. PCR stutters due to slippage are expected to 
be most frequent for those repeats having a short repeti- 
tive unit and a high copy number (32). Moreover, PCR stut- 
tering usually results in a PCR product that is one or two 
copies shorter or longer than the actual repeat, and the PCR 
yield of the stutter product is typically much lower than that 
of the PCR product with the correct length (33,34). Bearing 
this in mind, we corrected for those apparent heterozygous 
genotype calls in which the copy numbers differed by one 
or two copies, e.g. 1 1 and 10 copies, and for which the num- 
ber of reads was significantly different between both alleles, 
e.g. 1 5 and 2 reads, respectively. In this example, the 10-copy 



allele is likely a stutter product of a homozygous genotype 
of 11 copies. As a consequence, the 10-copy allele should 
be corrected to an allele with 1 1 copies. We confirmed such 
stutter artifacts for 78 out of 108 (72%) apparently heterozy- 
gous genotypes for which two alleles had copy number dif- 
ferences of one or two (see Supplementary Table S2). We 
also noticed that in most cases of these erroneously called 
'stutter alleles' there was at least a two-fold difference in the 
read numbers of the true allele and its stutter, with the stut- 
ter always having less reads. Based on these results we de- 
duced a 'stutter correction rule' for each repeat with a dif- 
ference of one or two copies between alleles; both alleles are 
considered as true alleles only if the percentage of the lowest 
read number relative to the highest one, (R2/Rl)x 100%, is 
>50%, which we called the read number imbalance (I). For 
the given example 7= (2/15)x 100%= 13.33%, which thus is 
below the 50% threshold, meaning that the genotype should 
be corrected for a stutter and thus becomes a homozygous 
genotype with 1 1 copies. Following this rule we in silico cor- 
rected 64 out of 108 tested genotypes (59.3%) with I < 50%, 
and from these, 55 (86%) were confirmed and only 9 (14%) 
were truly heterozygous (Supplementary Table S2), demon- 
strating that stutters could reliably be detected based on the 
read number imbalance rule (7). However, from the remain- 
ing 44 with I > 50% that were not corrected, 23 were also 
confirmed as stutter cases, which thus are false negatives 
(21%) that escape our correction rule. Setting the threshold 
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higher reduced the number of false negatives but also in- 
creased the number of false positives. In order to maximize 
the number of heterozygous calls, we opted to keep the / > 
50%. 

The main reason why we could not correct some stutter 
products is that the total coverage for some repeats was sim- 
ply too low to be able to differentiate true alleles from stutter 
products by read depth imbalance. Moreover, the low num- 
ber of reads can also prevent the detection of both alleles in 
a true heterozygous position. We thus implemented an ad- 
ditional filter step stating that the total read number (RT) 
should be above a certain threshold, which we arbitrarily 
set to RT > 4 in all seven individuals. 

We also identified a few genotype calls for which the to- 
tal read number was higher than the sum of the reads of 
two alleles (RT>R1+R2). For example, a heterozygous re- 
peat being called with CI = 49 and C2 = 59, and read num- 
bers Rl = 5 and R2 = 2 (RT = 7) was found by fragment 
analysis to consist of alleles with 49 and 60 copies. Anal- 
ysis of the aligned reads revealed that a wrongly heterozy- 
gous genotype was obtained because the longest allele (60 
copies) had only one read while its stutter (59 copies) had 
two (Supplementary Figure S7). Out of 184 genotypes (368 
alleles) that we validated by fragment analysis, we detected 
24 cases (6.5%) where one of the true alleles was missing, be- 
ing replaced by a stutter in the final genotype call. Following 
a conservative approach, we filtered out every genotype call 
for which RT^R1+R2, because this data inconsistency is 
most likely caused by stutter products or misaligned reads. 

After applying the stutter correction and read number 
(RT) filtering steps to our initial sequencing data, we cor- 
rected 23 804 genotypes in the seven samples (3000-3600 
genotypes per sample) and filtered out 9092 repeats. Then, 
from the 1654 repeats passing all filters as explained above 
(for each family member), we selected 10 random repeats to 
perform validation of the genotype calls by fragment anal- 
ysis. We did not find any erroneous calls in any of the seven 
family members (70 genotypes) (Supplementary Table S3). 
However, all 10 randomly selected repeats happened to be 
monomorphic (the same homozygous copy number in all 
seven family members). To also test our filtering criteria 
on the polymorphic repeats, we randomly selected 10 ad- 
ditional repeats from the polymorphic dataset and we val- 
idated those in all family members. Fragment analysis re- 
vealed an error rate of 7.9% (1 1/140 alleles). Taken together, 
despite the difficulties and technical limitations, the overall 
success rate in allele calling on our final dataset after filter- 
ing was estimated at 99.4% (Supplementary Table S3). 

Patterns of variability in tandem repeats 

In order to estimate the variability of tandem repeats in the 
human genome, we selected only those repeats that passed 
our filtering criteria and were genotyped in every individual 
of the family (1654 repeats). For each repeat, we measured 
its 'variability' in three different ways: first, we calculated 
the standard deviation of the alleles (i.e. their copy numbers) 
observed across the seven individuals (Figure 4); second, we 
counted the number of heterozygous individuals (Figure 5); 
finally, we counted the number of repeats with at least one 
allele that is different from the reference sequence observed 



over the seven individuals. We found variants different from 
the reference genome in 125 repeats (7.6%), hence regarded 
as polymorphic repeats. This variation was only 3.85% for 
minisatellites, but 9.95% for microsatellites showing that the 
variability of microsatellites is higher, as expected. Varia- 
tion revealed positive correlation with the total length of 
tandem repeats, with their copy number and sequence pu- 
rity (Supplementary Figure S8). We also found that based 
on the level of heterozygosity, as expected, repeats in cod- 
ing regions are significantly less variable (P < 0.01) than re- 
peats in introns, intergenic regions, 3' UTRs, and noncoding 
RNAs (Figures 4 and 5). 

From the 640 repeats in coding sequences genotyped in 
the seven individuals, 26 (4.06%) were polymorphic and 
thus were predicted to result in changes of a polypeptide 
length. For 17 repeats we examined the source of varia- 
tion and its effect on the amino acid sequence of a corre- 
sponding protein (Supplementary Table S4). As one repeat 
is a pentanucleotide, deletion of one repetitive unit causes 
a frameshift in GPR126, which generates a stop codon, at 
least in some transcripts, soon after this variation. The unit 
lengths of the remaining 16 repeats were multiples of 3 bp 
thus causing insertion or deletion of one or more amino 
acids, mostly within a poly amino acid tract. For four of 
the latter variations, the PROVEAN protein software pre- 
dicted a deleterious effect on the function of the proteins 
TAF7L, HEG1, ODFP1 and HYDIN (Supplementary Ta- 
ble S4). Analysis of these 17 variations by fragment analysis 
confirmed all except one, which validated our stringent fil- 
tering criteria. As a result, one frameshift and three poten- 
tial harmful in-frame variants were detected in this family. 

Interestingly, repeats in 5' UTRs seem highly conserved 
(Figure 4), which, however, could be due to the fact that 
we only obtain full reads for 1 1 repeats in all seven family 
members. 

We also performed a correlation analysis of the variability 
predicted by the SERV score and the observed variability 
in our data but this correlation was only 0. 13. However, we 
verified that the average SERV score for repeats with at least 
one heterozygous individual was significantly higher than 
that for conserved (monomorphic) repeats (Wilcoxon rank 
test, P — 2.925 x 10~ 8 ), thereby confirming the accuracy of 
the SERV algorithm to predict the variability of a tandem 
repeat (Supplementary Figure S9). 

Our selection of repeats for sequencing was biased in two 
ways, because we specifically targeted potentially polymor- 
phic repeats as well as repeats located within functional re- 
gions. To estimate how this bias affects the observed varia- 
tion rate, we analyzed the variability of four groups of re- 
peats that represent combinations of the above mentioned 
two factors: (i) repeats predicted as non-polymorphic and 
located within regions without a clear biological function 
(gene deserts); (ii) repeats that are predicted to be poly- 
morphic and that are located in gene deserts; (iii) repeats 
that are predicted to be non-polymorphic and are located 
within functional regions and (iv) repeats that are predicted 
to be polymorphic and are located within functional re- 
gions. When we extrapolate the variability for each of the 
four repeat categories in our dataset to the whole genome 
(taking into account the proportion of each repeat class in 
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the full genome), we estimate about 9.3% of all repeats to 
be polymorphic in this family (Supplementary Table S5). 

De novo tandem repeat variation 

Based on the final filtered dataset of 1 654 tandem repeats we 
looked for Mendelian inconsistencies within the four trios 
we analyzed. After applying our set filtering criteria (/ > 
50% and >5 reads in all individuals) we were left with only 
55 repeats that showed an apparent Mendelian inconsis- 
tency in at least one trio. Fifteen of those repeats were tested 
by fragment analysis, but none of these proved to be a true 
de novo mutation. From the 15 candidate repeats tested, in 
seven cases a stutter was causing an erroneous heterozygous 
call even with / > 50%. The remaining eight true heterozy- 
gous calls, with / < 50%, were erroneously corrected into 
homozygotes marking them as false positives after the cor- 
rection step. Based on these validation data, the remaining 
40 repeats with apparent inconsistencies were manually ex- 
amined, and revealed that they fit one of both types of errors 
described above. 

Though no de novo events were detected in the final cor- 
rected and filtered dataset, one de novo mutation was iden- 
tified and confirmed by fragment analysis before these cor- 
rection steps. The AAGA tetranucleotide repeat at Chrl:84 
267 437-84 267 512 showed an inconsistency of inheritance 
in childl (Chi). The called alleles (CI and C2) in the mother 
were 17 and 20 while Chi carried alleles with copy num- 
bers 18 and 19. The 18 -copy allele was inherited from his 
father indicating that the 19-copy allele should have been 
derived from a maternal allele i.e. by contraction of the 20- 
copy allele with one copy or by expansion of the 17-copy 
allele with two copies (Figure 6A). Validation of this repeat 
by fragment analysis confirmed the allelic inconsistency in 
the mother and her son. To prove that the fragment length 
difference was indeed caused by a copy number change, we 
subcloned the alleles from the mother and childl and con- 
firmed the de novo variation by regular Sanger sequencing 
(Figure 6B). 

DISCUSSION 

Like other genomes, the human genome is scattered with 
tandem repeats. Using a combination of repeat databases 
and de novo searches with a commonly used algorithm for 
repeat detection, we identified almost 800 000 tandem re- 
peats in the human genome including mononucleotide re- 
peats, microsatellites and minisatellites. More than 6000 of 
these repeats are located within coding regions, and almost 
half of these loci in the human genome are located near or 
within genes where variation in the repeats might have func- 
tional consequences. 

In this study, we describe a method to simultaneously 
characterize thousands of selected tandem repeats in hu- 
mans. More specifically, we first developed a strategy for 
the selection of tandem repeats that can be captured using 
unique flanking sequences, then captured about 10 000 se- 
lected repeats from the genomes of seven family members 
followed by massive parallel sequencing of those by 454 GS- 
FLX+ technology. Finally, we mapped the millions of reads 
with subsequent determination of the repeat copy numbers 



using our own software. Lastly, we developed data filtering 
algorithms based on validation by fragment analysis and 
Sanger sequencing. Our study of this three-generation fam- 
ily demonstrated that overall, 7.6% of tandem repeats are 
polymorphic. Importantly, 4.06% of repeats located within 
coding regions have an allele that differs from the reference 
sequence. Finally, of the 4447 repeats called in all seven fam- 
ily members, we found one example of a de novo variation 
that illustrates the instability of these regions. 

A few recent studies have contributed to a better view 
of repeat variability in the human genome, but they were 
significantly limited in the number of targets to study be- 
cause, due to technical limitations, they could focus only on 
short repeats. A genome-wide population-scale microsatel- 
lite analysis of >500 individuals of the 1000 Genomes 
Project exome sequencing pilot study, performed on 454 
and/or Illumina instruments, was analyzed for 8342 repeats 
with the majority (94.5%) <20 bp in total length (35). Simi- 
larly, whole genome paired-end Illumina sequencing in 200 
Drosophila inbred strains (36) or human gastric cancer cell 
lines and primary tissues (37), as well as the targeted cap- 
ture MiSeq sequencing of thousands of selected short tan- 
dem repeats (38) had to be restricted to short microsatel- 
lites. However, sequencing longer reads is especially crucial 
to study tandem repeat variability because this variability 
increases with repeat length (39,40). Our success in geno- 
typing a larger number of coding and potential regulatory 
tandems thus is mainly due to the longer reads that were ob- 
tained by the GS-FLX+ Roche technology (up to 700 bp) by 
which we could increase the actual repeat size up to 250 bp. 
In previous reports, the GS-FLX system with maximal read 
lengths of 450 bp had been validated for efficient genotyping 
of a selection of five microsatellites in 10 individuals (41), 
or a much larger pool of tandem repeats in several different 
microbial species (42). In our study, we enlarged the selec- 
tion of tandem repeats by number (>10 000) and range of 
characteristics (both micro- and minisatellites, total repeat 
size up to 250 bp, low and high copy numbers included) for 
a more comprehensive analysis in seven individuals of the 
same family. 

Our method for selective capture of tandem repeats by 
flanking, spanning and special probes is relatively efficient. 
It was demonstrated that spanning probes, which we de- 
signed to capture short-length repeats, do not compromise 
accuracy. In addition, we showed that a novel capture probe 
design including both flanks of the repeat (called special 
probes) can also capture a repeat with an efficiency com- 
parable to that of flanking probes. The use of a similar cap- 
ture method with spanning probes for eight repeat motifs 
(42), or with flanking probes for 7851 target loci (38) have 
recently been described. In the latter case, the capture effi- 
ciency was 38.7%, while in our study the combined usage 
of up to four types of probes allowed us to reach a cap- 
ture efficiency >60%. In the latter reported study, however, 
only 2.2% of all HiSeq reads completely spanned the repeat, 
thereby severely limiting the use for large-scale genotyping 
(38). Our study showed that longer reads yielded a very sig- 
nificant increase in sequencing efficiency: specifically, about 
25% of all sequencing reads covered a complete repeat se- 
quence together with 3' and 5' flanking regions that allowed 
mapping the repeat. We also found a bell-shaped relation- 
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Figure 6. De novo generation of a tandem repeat allele in childl. (A) Fragment analysis view of the alleles for the repeat chrl:84 267 437-84 267 512 in the 
father, mother and childl demonstrating that the 19-copy allele in childl occurred de novo, most likely by a contraction event. (B) Alignment of the Sanger 
sequenced cloned alleles of the mother (Mo) and childl (Chl ). Copy numbers of the (imperfect) AAGA repeat are indicated below. 



ship between the sequencing coverage and GC content with 
a preference to moderate GC content as has been reported 
by others (38). Despite our attempt to solve this issue by 
doubling the number of probes for the loci with both low 
and high GC content (<40% and >70%), the data showed 
that this procedure is not sufficient to capture repeats in GC- 
rich genomic regions. 

The biggest technical challenge encountered with tandem 
repeat genotyping is PCR stuttering, which can lead to er- 
roneous calls of the copy number for a given tandem repeat. 
Recently, Highnam and colleagues developed new software 
to deal with this problem by estimating probabilities for er- 
rors to appear in repeats with a specific set of character- 
istics. Applying these probabilities to the genotype calls of 
the respective repeats should allow making better decisions 
whether a variant is more likely to be a stutter artifact or 
a real allele (23). This predictive approach requires prior 
knowledge of the likelihood to generate stutters for each 
type of repeat, which unfortunately was lacking. Therefore, 
we deduced the filtering rules for stutter correction based 
on fragment analysis validation data. The only way to dis- 
tinguish a real allele from a stutter artifact was to compare 
the read frequency with that of the main allele. The relia- 
bility of this proportional approach in general is very good 
but obviously increases with a higher total number of reads 
per locus. 

After these filtering steps, 92.4% of the sequenced repeats 
turned out to be monomorphic, leaving 7.6% classified as 
polymorphic in the studied family (having at least one al- 
lele different from the reference sequence). This variation 
in copy numbers was unevenly distributed between min- 



isatellites (3.85%) and microsatellites (9.95%). The latter 
rate is higher than the 5.9% estimated for microsatellites by 
Mclver et al. in the CEU population from the 1000 Genome 
project (35). We anticipate, however, that the true percent- 
age of variability is somewhat lower due to stutters that still 
escape our correction step. On the other hand, our study 
only comprised seven individuals from one family consist- 
ing of only three unrelated individuals (GF, GM and Fa). 
Sequencing more unrelated individuals would therefore in- 
crease the observed repeat variability. Even though our ini- 
tial selection of targeted repeats was biased toward those 
with SERV scores > 1 .0, i.e. predicted to be more likely vari- 
able, the selection was also based on their location in the 
genome as an indicator for their potential to induce phe- 
notypic variation. Therefore, natural selection against un- 
favorable variants might also have influenced the observed 
variability in our set of tandem repeats. The very high occur- 
rence of monomorphic repeats and thus high stability of the 
copy number could be explained by a (yet unknown) phe- 
notypically important localization of the repeats, especially 
taking into account that 70% of the 10 746 selected repeats 
belong to the presumably functional (RF) group. 

The genome-wide variation of tandem repeats was esti- 
mated by extrapolation of the observed percentage of poly- 
morphisms among repeats with SERV scores > 1 versus 
SERV scores < 1 , and in repeats from functional (RF) ver- 
sus non-functional (RI) groups. Such approach yields an 
estimation of the overall portion of polymorphic repeats 
at about 9%. This variability rate is 7.4 times higher than 
that observed for SNPs (1 .25%) (43), though it was expected 
to be up to 1000-fold higher (2). A potential explanation 
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might relate to the unclear definition of tandem repetitive 
sequences. Thereby, a large number of sequences recognized 
as repeats in the genome might be still fairly resistant to 
DNA slippage during replication due to specific features 
such as repeat imperfectness, long repeat motifs or low copy 
number. In any case, our findings show a much higher level 
of variability compared to the previous estimate in tandem 
repeats of 1% (35). Although larger scale genotyping on un- 
related individuals would be needed to confirm our data, the 
percentage predicted in our study is encouraging to further 
investigate the variability of tandem repeats and its conse- 
quences on different traits. 

Interestingly, we also detected one de novo mutation in 
this small family. This sequence change is most likely due to 
a contraction of a maternal allele with one repetitive unit al- 
though we cannot exclude a two-unit expansion of the other 
maternal allele. However, since a one-unit change is more 
likely than a gain of two units at once (44,45), the former op- 
tion is preferred. When focusing on repeats located within 
the coding part that were sequenced in all seven individu- 
als, we found 17 repeats for which at least one variation was 
detected when compared to the reference sequence. Four 
of these variants are potentially deleterious, which demon- 
strates that our strategy is capable of detecting protein alter- 
ing variants with possible functional consequences. How- 
ever, it is clear that subsequent functional studies are re- 
quired to determine the impact of each of these repeat vari- 
ants. 

Because repeats are mostly ignored in today's compar- 
ative genomics and GWAS studies, they could be partly 
responsible for the so-called "missing heritability", i.e. in- 
stances where a phenotype has a clear genetic basis but 
where no genetic aberration could be found. The capture 
and sequencing strategy described in this study may provide 
a stepping stone for routine genome-scale repeat character- 
ization. It is worth mentioning that our method is not re- 
stricted to humans but can be applied for a comprehensive 
analysis of repeats in any species with a reference genome. 
With the constant improvements in read length and se- 
quencing cost reductions, we expect that this method can 
be applied for larger sets of repeats and with deeper cover- 
age, increasing its accuracy and cost effectiveness. Emerging 
long-read NGS technologies such as PacBio, Ion Proton or 
Nanopore can provide the throughput needed to make this 
happen within the next year. 

In summary, we report on a family-based analysis of 10 
000 selected tandem repeats using long sequence reads that 
yielded the complete repeat sequences in 25% of reads. Us- 
ing this method, we provide a better estimate of their vari- 
ability and demonstrate the occurrence of a de novo muta- 
tion event. This novel method provides major opportunities 
for genome-wide population-based genotyping for the as- 
sociation of repeat variability with common traits as well as 
disease. 
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